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Abstract: We study the nonparametric covariance estimation of a stationary Gaussian field X 
observed on a lattice. To tackle this issue, a neighborhood selection procedure has been recently 
introduced. This procedure amounts to selecting a neighborhood to by a penalization method 
and estimating the covariance of X in the space of Gaussian Markov random fields (GMRFs) 
with neighborhood m. Such a strategy is shown to satisfy oracle inequalities as well as minimax 
adaptive properties. However, it suffers several drawbacks which make the method difficult to apply 
in practice. On the one hand, the penalty depends on some unknown quantities. On the other hand, 
the procedure is only defined for toroidal lattices. The present contribution is threefold. A data- 
driven algorithm is proposed for tuning the penalty function. Moreover, the procedure is extended 
to non-toroidal lattices. Finally, numerical study illustrate the performances of the method on 
simulated examples. These simulations suggest that Gaussian Markov random field selection is 
often a good alternative to variogram estimation. 

Key-words: Gaussian field, Gaussian Markov random field, Data-driven calibration, model 
selection, pseudolikelihood. 
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Selection automatique de voisinage d'un champ gaussien 



Resume : Nous etudions l'estimation non-parametrique d'un champ gaussien stationnaire X 
observe sur un reseau r egulier. Dans ce cadre, nous avons precedemment introduit une procedure 
de selection de modele Ver09j . Cette procedure revient a selectionner un voisinage fh grace une 



technique de penalisation puis a estimer la covariance du champ X dans l'espace des champs de 
Markov gaussiens de voisinage fh. Une telle strategic satisfait des inegalites oracles et des proprietes 
d'apdaptation au sens minimax. En pratique, elle presente neanmoins quelques inconvenients. 
D'une part, la penalite depend de quantites inconnues. D'autre part, la procedure est uniquement 
definie pour des reseaux toriques. La contribution de cet article est triple. Nous proposons un 
algorithme automatique pour calibrer la penalite. De plus, nous introduisons une extension a des 
reseaux non-toriques. Enfin, nous etudions les performances pratiques de la procedure sur des 
donnees simulees. Ces simulations suggerent que la selection de champs de Markov gaussiens est 
souvent une bonne alternative a l'estimation de variogramme. 

Mots-cles : Champ gaussien, champ de Markov gaussien, calibration automatique, selection de 
modele, pseudo-vraisemblance. 
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1 Introduction 

We study the estimation of the distribution of a stationary Gaussian field {X[i,j})^ ^ £ ^ indexed by 
the nodes of a rectangular lattice A of size p\ x p 2 . This problem is often encountered in spatial 
statistics or in image analysis. Classical statistical procedures allow to estimate and subtract the 
trend. Henceforth, we assume that the field X is centered. Given a rt-sample of the field X, the 
challenge is to infer the correlation. In practice, the number n of observations often equals one. 
Different methods have been proposed to tackle this problem. 

A traditional approach amounts to computing an empirical variogram and the n fittin g a suit- 
able p arametric variogram model such as the exponential or Matern model (see [Cre93| Ch.2 or 
[Ste99l |). The main disadvantage with this method is that the practitioner is required to select a 
good variogram model. When t he field exhi bits long range dependence, specific procedures have 
been introduced (e.g. Frlas et al. [FARMA"08l0 . I n the sequel, we focus on small ra nge dep endences. 
Most of the nonparametric (Hall et al. |HFH94 |) and semiparametric (Im et al. ISZ07j ) methods 



are based on the spectral representation of the field. To our knowledge, these procedures have 
not yet been shown to achieve adaptiveness, i.e. their rate of convergence does not adapt to the 
complexity of the correlation functions. 

In this paper, we define and study a nonparametric estimation procedure relying on Gaussian 
Markov random fields (GMRF) . This procedure is computationally fast and satisfies adaptive prop- 
erties. Let us fix a node (0,0) at the center of A and let to be a subset of A \ {(0,0)}. The field 
X is a GMRF with respect to the neighborhood to if conditionally to PQM])(fc,J)em> the v ariable 
X[o,o] is independent from all the remaining variables in A. We refer to Rue and Held [RH05l | 
for a comprehensive introduction on GMRFs. If we know that AT is a GMRF with respect to the 
neighborhood to, then we can estimate the covariance by applying likelihood or pseudolikelihood 
maximization. Such para metric procedures are well understood, at least from an asymptotic point 
of view (see for instance [Guy95l | Sect. 4). However, we do not know in practice what is the "good" 
neighborhood to. For instance, choosing the empty neighborhood amounts to assuming that all the 
components of X are independent. Alternatively, if we choose the complete neighborhood, which 
contains all the nodes of A except (0,0), then the number of parameters is huge and estimation 
performances are poor. 

We tackle in this paper the problem of neighborhood selection from a practical point of view. 
The purpose is to define a data-driven procedure that picks a suitable neighborhood fh and then 
estimates the distribution of X in the space of GMRFs with neighborhood to. This procedure 
neither requires any knowledge on the correlation of X, nor assumes that the field X satisfies a 
Markov condition. Indeed, the procedure selects a neighborhood m that achieves a trade-off between 
an approximation error (distance between the true correlation and GMRFs with neighborhood to) 
and an estimation error (variance of the estimator). If X is a GMRF with respect to a small 
neighborhood, then the procedure achieves a parametric rate of convergence. Alternatively, if X is 
not a GMRF then the rate of convergence of the procedure depends on the rate of approximation of 
the true covariance by GMRFs with growing neighborhood. In short, the procedure is nonparametric 
and adaptive. 

Besag and Koope rberg [BK95l |. Rue and Tjelmeland RT02], Song et al. |SFG08l |. and Cressie 



and Verzelen have considered the problem of approximating the correlation of a Gaussian 

field by a GMRF, but thi s approach requires the knowledge of the true distribution. Guyon and 
Yao have stated in necessary conditions and sufficient conditions for a model selection pro- 
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cedure to choose asymptotically the true neighborhood of a GMRF with probability one. Our 
point of view is slightly different. We do not assume that the field X is a GMRF with respect to 
a sparse neighborhood. We do not aim at estimating the true neighborhood, we rather want to 
select a neighborhood that allows to estimate well the distribution of X (i.e. to minimize a risk). 
The distinction betwe en thes e two points of view has been nicely described in the first chapter of 
MacQuarrie and Tsai |MT98| . 



In (VerO^ . we have introduced a neighborhood selection procedure based on pseudolikelihood 
maximization and penalization. Under mild assumptions, the procedure achieves optimal neigh- 
borhood selection. More precisely, it satisfies an oracle inequality and it is minimax adaptive to 
the sparsity of the neighborhood. To our knowledge, these are the first results of neighborhood 
selection in this spatial setting. 

If the procedure exhibits appealing theoretical properties, it suffers several drawbacks from a 
practical perspective. First, the method constrains the largest eigenvalue of the estimated covari- 
ance to be smaller than some parameter p. In practice, it is difficult to choose p since we do not 
know the largest eigenvalue of the true covariance. Second, the penalty function pen(.) introduced 
in Sect. 3 of the previous paper depends on the largest eigenvalue of the covariance of the field X . 
Hence, we need a practical method for tuning the penalty. Third, the procedure has only been 
defined when the lattice A is a square torus. 

Our contribution is twofold. On the one hand, we propose practical versions of our neighborhood 
selection procedure that overcome the previously-mentioned drawbacks: 

• The procedure is extended to rectangular lattices. 

• We do not constrain anymore the largest eigenvalue of the covariance. 



• We provide an algorithm based on the so-called slope heuristics of Birge and Massart [BM07 | 
for tuning the penalty. Theoretical justifications for its use are also given. 

• Finally, we extend the procedure to the case where the lattice A is not a torus. 

On the other hand, we illustrate the performances of this new procedur e on nu merical ex amples . 
When A is a torus, we compare it with likelihood-based methods like AIC Aka73| and BIC |Sch78l |. 



even if they were not studied in this setting. When A is not toroidal, likelihood methods become in- 
tractable. Nevertheless, our procedure still applies and often outperforms variogram-based methods. 

The pape r is org anized as follows. In Section [2j we define a new version of the estimation 
procedure of [Ver09l | that does not require anymore the choice of the constant p. We also discuss 
the computational complexity of the procedure. In Section we connect this new procedure to 
the original method and we recall some theoretical results. We provide an algorithm for tuning the 
penalty in practice in Section 01 In Section we extend our procedure for handling non-toroidal 
lattices. The simulation studies are provided in Section[6l Section [7] summarizes our findings, while 
the proofs are postponed to Section [HI 

Let us introduce some notations. In the sequel, X v refers to the vectorialized version of X with 
the convention X[i,j] = X v [(i-i)xp 2 +j] for any 1 < i < p\ and 1 < j < p2- Using this new notation 
amounts to "forgetting" the spatial structure of X and allows to get into a more classical statistical 
framework. We note Xi,X 2 , . . . ,X n the n observations of the field X. The matrix X stands for 
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the covariance matrix of X v . For any matrix A, ip max (A) and tpm\ B (A) respectively refer the largest 
eigenvalue and the smallest eigenvalues of A. Finally, I r denotes the identity matrix of size r. 



2 Neighborhood selection on a torus 

In this section, we introduce the main concepts and notations for GMRFs on a torus. Afterwards, 
we describe our procedure based on pseudolikelihood maximization. Finally, we discuss some com- 
putational aspects. Throughout this section and the two following sections, the lattice A is assumed 
to be toroidal. Consequently, the components of the matrices X are taken modulo p\ and p2- 



2.1 GMRFs on the torus 

The notion of conditional distr ibution is underlying the definition of GMRFs. By standard Gaussian 
derivations (see for instance Lau96| App.C), there exists a unique p\ x p2 matrix 9 such that 
6[o,o] = and 

X[o,o] = 0[i,j]X[i,j] + e[o,o] , (1) 

W)eA\{(o,o)} 

where the random variable e[o,o] follows a zero-mean normal distribution and is independent from 
the covariates (-X"[».j'])(i,;;)eA\{(o,o)}- The linear combination S(i,j)eA\{(o o)} ^M-^Wl ls the kriging 
predictor of X[o,o] given the remaining variables. In the sequel, we note cr 2 the variance of e[o,o] and 
we call it the conditional variance of X[o,o]. 

Equation {TJ describes the conditional distribution of X[o,o] given the remaining variables. By 
stationarity of the field X, it holds that that 6[i,j] = 9[-i,-j\. The covariance matrix £ is closely 
related to 9 through the following equation: 

£ = o* II P1P 2 - C(B)]~ 1 , (2) 

where the pip2 X pip2 matrix C{9) is defined by C(9)[(i 1 -i)p 2 +ji,(i2-i)P2+j2] := 9[i 2 -i 1 ,j 2 -j 1 ] for any 
1 < *i,*2 < Pi and 1 < ji,j2 < P2- The matrix (I PlP2 — C(9)) is called the partial correlation 
matrix of th e field X . The so-defined matrix C{9) i s symm etric block circulant with P2 X P2 blocks. 
We refer to Sect. 2. 6 or the book of Gray |Gra06| for definitions and main properties on 

circulant and block circulant matrices. 



Identities {J) and (J2]) have two main consequences. First, estimating the p\ x p 2 matrix 9 
amounts to estimating the covariance matrix E up to a multiplicative constant. We shall therefore 
focus on 9. Second, by Equation {TJ, the field X is a GMRF with respect to the neighborhood 
defined by the support 9. The adaptive estimation issue of the distribution of X by neighborhood 
selection therefore reformulates as an adaptive estimation problem of the matrix 9 via support 
selection. 

Let us now precise the set of possible values for 9. The set 8 denotes the vector space of the 
Pi xp2 matrices that satisfy 0[o,o] = and 9[i,j] = 9[-i,-j], for any € A. Hence, a matrix 9 £ 9 
corresponds to the distribution of a stationary Gaussian field if and only if the P1P2 x P1P2 matrix 
(^PiP2 — C(0)) is positive definite. This is why we define the convex subset 6 + of by 

6 + := {fe9 s.t. [I P1P2 - C{9)] is positive definite} . (3) 
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The set of covariance matrices of stationary Gaussian fields on A with unit conditional variance is 
in one to one correspondence with the set 9 + . We sometimes assume that the field X is isotropic. 
The corresponding sets 6 lso and 6 + ' 1S0 for isotropic fields are introduced as: 

6 iso := {0 G 9 , 6[ij] = 6[-ij] = 9[j,i] , V(i, j) G A} and 9+< iso := 6+ n 9 iso . 

2.2 Description of the procedure 

Let |(i,j)|t refer to the toroidal norm defined by 

:= [i A (pi - i)] 2 + y A (p 2 - j)] 2 , 

for any node £ A. 

In the sequel, a model to stands for a subset of A \ {(0,0)}. It is also called a neighborhood. 
For the sake of simplicity, we shall only use the collection of models Mi defined below. 

Definition 2.1. A subset m C A \ {(0,0)} belongs to Mi if and only if there exists a number 
r m > 1 such that 

m = {(i,i)eA\{(0,0)} s.t. \(i,j)\ t <r m } . (4) 

In other words, the neighborhoods to in A4i are sets of nodes lying in a disc centered at (0, 0). 
Obviously, A^i is totally ordered with respect to the inclusion. Consequently, we order the models 
too C mi C.Cra,.... For instance, too corresponds to the empty neighborhood, toi stands for 
the neighborhood of size 4, and TO2 refers to the neighborhood with 8 neighbours. See Figure Q] for 
an illustration. 



a) 




Figure 1: (a) Model mi with first order neighbors, (b) Model iri2 with second order neighbors, (c) 
Model TO3 with third order neighbors. 

For any model to G Mi, the vector space m is the subset of matrices whose support is 
included in to. Similarly 9^° is the subset of 1SO whose support is included in to. The dimensions 
of 9 m and 0'^° are respectively noted d m and g^°. Since we aim at estimating the positive matrix 
(I PlP2 — C(9)), we also consider the convex subsets of 9+ and 9+' 1S0 which correspond to non- 
negative precision matrices. 

9+ := Q m n 9+ and 9,+< iso := 9^° n 9+< iso . (5) 
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For any 6' € , the conditional least-squares (CLS) criterion 7„ Pl p, (#') |Guv87l | is defined 

by 

I " / \ 2 

ln, Pl , P2 (0') := E E ( x < X! tf'Pi.WXi^A-Hd) ■ (6) 

UPlP2 <=i 0u2)6A V (Ji,i 2 )eA\{(o,o)} ' 

The function 7n,pi,p 2 (-) i s a least-squares criterion that allows us to perform the simultaneous lin- 
ear regression of all ~Ki[j 1 ,j 2 ] with respect to the covariates CX-jlh ,i 2 ])n, h )=£(k, >,) ■ This criterion is 
closely connected with the pseudolikelihood introduced by Besag [Bes75l |. The as sociated estimator 
is slightly less efficient estimator than maximum likelihood estimation ( |Guy95l | Sect. 4. 3). Never- 
thel ess, its computation is much faster since it does not involve determinants as for the likelihood. 
See |Ver09l | Sect. 7.1, for a more complete comparison between CLS and maximum likelihood esti- 
mators in this setting. For any model me Mi, the estimators are defined as the unique minimizers 
of 7n,pi,p 2 (-) on the sets 9+ and 0^ IS0 . 

9 m := arg min 7 n , Pl , p ,(6>') and := arg min 7„, P1 iP2 (9') , (7) 

where A stands for the closure of A. We further discuss the connection between t/ m and "rn.pi in 
Section [3l 

Given a subcollection of models M of M.\ and a positive function pen : Ai — ► R + called a 
penalty, we select a model as follows: 



m := arg mm 



7n,pi,p 2 [6m ) + pen(m) and m lso := arg min 7„, Pl , P2 ( 0™ ) + pen(m) . (8) 

V / J mGA-1 L V / 



For short, we write 9 and 9 lso for 6*„ and 0J~° BO - We discuss the choice of the penalty function in 
Section [H 



2.3 Computational aspects 

Since the lattice A is a torus, the computation of the estimators 9 m is performed efficiently thanks 
to the following lemma. 

Lemma 2.1. For any p x p matrix A and for any 1 < i < p\ and 1 < j < p%, let X[i,j](A) be the 
(i,j)-th term of two-dimensional discrete Fourier transform of the matrix A, i.e. 

pi Pi 

X[i,j](A) := EE^ [i,j]exp 
fe=i i=i 

where l 2 = — 1. The conditional least-squares criterion 7n,pi,p 2 (^') simplifies as 

1 1P ' 2 ^ (=1 j=l 

A proof is given in Section [8l Optimization of 7n, Pl ,p 2 (-) over the set 0+ is performed fastly 
using the fast Fourier transform (FFT). Nevertheless, this is not the privilege of CLS estimators, 
since maximum likelihood estimators are also computed fastly by FFT when A is a torus. 

In Section [5j we mention that the computation of the CLS estimators 9 m remains quite easy 
when A is not a torus whereas likelihood maximization becomes intractable. 



, ki jl 
2ltt\ h — 

yPl P2 



(9) 



X A '-< PWm (x fc ) 
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3 Theoretical results 

Throughout this section, A is assumed to be a toroidal square lattice and we note p its size. Let us 
mention that the restriction to square lattices made in allows to simplify the proofs but is 

not necessary so that the theoretical results hold. In this section, we first recall the original proce- 
dure and we emphasize the differences with the one defined in the previous section. We also mention 
a result of optimality. This will provide some insights for calibrating the penalty pen(.) in Section [4j 

Given p > 2 be a positive constant, we define the subsets 9+ „ and by 

0+, := {9eO+, [I P1P2 - C(9)} < p) (10) 

0^p S ° := {0e 9+- iso , vw [I P1P2 - C(9)} < p) . 

Then, the corresponding estimators 9 mp and 6>J^° are defined as in |(7|), except that we now consider 

0+ p inste ad of 6 + ■ Let us mention that the estimator 9 m corresponds to the estimator # mjPl 
defined in Ver09l | Sect. 2. 2 with p\ = +oo. 



d m ., P ■= arg min j n . p . p (0') and 9^ p := arg min y n . p . p (9') . 

Given a subcollection M. of M.\ and a penalty function pen(.), we select the models fh p and rfi™ 
as in ((Sj) except that we use 6*„ ljP and 9™° p instead of 9 m and 9™°. We also note 9 P and 9 p so for 
and ' 



m p ,p 



The only difference between the estimators 9 and 9 p is that the largest eigenvalue of the preci- 
sion matrix (I p 2 — C{9)) is restricted to be smaller than p. We make this restriction in Ver09l | to 
facilitate the analysis. 

In order to assess the performance of the penalized estimator 9 P and 9 p ao , we use the prediction 
loss function 1(9 1,92) defined by 

l(9 1 ,6 2 ) := \tr \(C(9 X ) - C{9 2 ))^{C{9 1 ) - C(9 2 ))} . (11) 

As explained in Ver09j Sect. 1.3, the loss l(6x, 9 2) expresses in terms of conditional expectation 

Z(0!,0 2 ) =E e {[E fll (X[o,o]|X M{0i0} ) -E 9 . 2 (A[o,o]|A AUo , 0} )] 2 } , (12) 

where Eg(.) stands for the expectation with respect to the distribution Af(0,a 2 (I PlP2 — C(9))~ 1 ). 
Hence, 1(9, 9) corresponds the mean squared pre diction l oss of X[o,o] given the other covariates. A 
similar loss function is also used by Song et al. jSFG08| . when approximation Gaussian fields by 
GMRFs. For any neighborhood m £ M, we define the projection 9 miP as the closest element of 9 
in ®m, P with respect to the loss l(., .). 

m , p :=arg min 1(6' ,6) and 9™ p := arg min 1(6', 9) . 
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We call the loss l(9 miP ,9) the bias of the set 0+ „. This implies that 9 m ,p cannot perform better 
than this loss. 

Theorem 3.1. Let p > 2, K be a positive number larger than an universal constant Kq and M. be 
a subcollection of M \ . If for every model m £ M, it holds that 

pen{m) > Kp 2 tp max (Y<) dm + 1 , (13) 

np z 

then for any 9 £ + , the estimator 9 P satisfies 

E e [l(6 p ,9)]<L(K) inf [l(9 miP ,6) + pen(m)} , (14) 

where L(K) only depends on K . A similar bound holds if one replaces 9 p by 9™°, + by Q+- lso ! 
Om,P h C° P ; and d m by tf™. 

Although we have assumed the correlation is non-singular, the theorem still holds if the spa tial 
field is constant. The nonasymptotic bound is provided in a slightly different version in [VerD^ . It 



states that 9 P achieves a trade-off between the bias and a variance term if the penalty is suitable 
chosen. In Theorem 13.11 we use the penalty Kp 2 ip max (T,)(d m + l)/(np 2 ) instead of the penalty 
K p 2 ip max (Y.)d m / (np 2 ) stated in the previous paper. This makes the bound (fT4|) simpler. Observe 
that these two penalties yield the same model selection since they only differ by a constant. Let us 
further discuss two points. 

• In this paper, we use the estimator 9 rather than 9 p . Given a collection of models M., there 
exists some finite p > 2, such that these two estimators coincide. Take for instance p = 
svjp m(£M sup eGe + <y9 m ax(^pip2 — Admittedly, the so-obtained p may be large, especially 
if there are large models in M. The upper bound (fT4|) on the risk therefore becomes worse. 
Nevertheless, we do not think that the dependency of fl4|) on p is sharp. Indeed , we illustrate 
in Section [6] that the risk of 9 exhibits good statistical performances. 

• Theorem l3.1l provides a suitable form of the penalty for obtaining oracle inequalities. However, 
this penalty depends on (p max (S) which is not known in practice. This is why we develop a 
data-driven penalization method in the next section. 



4 Slope Heuristics 

Let us introduce a data-driven method for calibrating the pe nalty fu nction pen(.). It is based on 
the so-called slope heuristic int roduced by Birge and Massart |BMn7l | in the fixed design Gaussian 



regression framework (see also [Mas07l | Sect. 8. 5. 2). This heuristic relies on the notion of minimal 
penalty. In short, assume that one knows that a good penalty has a form pen(m) = NF(d m ) 
(where d m is the dimension of the model and N is a tuning parameter). Let us define fh(N) the 
selected model as a function of N. There exists a quantity A^ m ; n satisfying the following property: If 
N > N m i n , the dimension of the selected model dfh(N) is reasonable and if N < N m \ n , the dimension 
of the selected model is huge. The function pen m i n (.) := N m - m F(.) is called the minimal penalty. 
In fact, a dimension jump occurs for dfn(N) a t the point A^in. Thus, the quantity Af m j n is clearly 
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observable for real data sets. In their Gaussian framework, Birge and Massart have shown that 
twice the minimal penalty is nearly the optimal penalty. In other words, the model rh := TO(27V m j n ) 
yields an efficient estimator. 

The slope heuristic method has been successfully applied for multiple change-point detection 
[Leb05l |. A pplicatio ns are also being developed in othe r frameworks s uch as m ixture models MMQgl ]. 
clustering [BCM08], estimation of oil reserves Lep02| . and genomic Vil07 |. 

If th is method was originally introduced for fixed design Gaussian regression, Arlot and Massart 
AM09l | have proved more recently that a similar phenomenon occurs in the heteroscedastic random- 



design case. In the GMRF setting, we are only able to partially justify this heuristic. For the sake 
of simplicity, let us assume in the next proposition that the lattice A is a square of size p. 

Proposition 4.1. Consider p > 2, and rj < 1 and suppose that p is la rger than some numerical 
constant po- Let rn! be the largest model in M.\ that satisfies d m > < \j np 2 . For any model m e Mi, 
we assume that 

pen{m') - pen(m) < Kx{l - nV {<p min (l p 2 - C(6)) A [p - y max (l p 2 - C{6))] } ^'^f™ , (15) 
where K\ is a universal (constant defined in the proof). Then, for any 9 £ 9+, , it holds that 



> L 



m' ,p i 

i 



> 

~ 2 



where L only depends on r\, p, <p m i n (l p 2 — C(9)) , and (p max (ip 2 — C(6)) . 
The proof is postponed to Section [8l Let us define 

Ni := K lt j 2 {^ min {I P1P2 ~ C{6)) A [p - </w {I P1P2 - C(0))]} , 

and let us consider penalty functions pen(m) = N np ^ for some N > 0. The proposition states 
that if N is smaller than Ni, then the procedure selects a model of huge dimension with large 
probability, i.e dm(JV) is huge. Alternatively, let us define 

N 2 := Ka dm 



(I P1P2 - C(6)) npip 2 

where the numerical constant Kq is introduced in Theorem 3.1 in [Verf)d |. By Theorem 3.1, choosing 
N > N2 ensures that the risk of 8 p achieves a type-oracle inequality and the dimension dft p (Af) is 
reasonable. The quantities N\ and N2 are different especially when the eigenvalues of {I PlP2 —C(8)) 
are far from 1. Since we do not know the behavior of the selected model rh p (N) when N is between 
Ni and N2, we are not able to really prove a dimension jump as the fixed design Gaussian regression 
framework^ Besides, we have mentioned in the preceding section that we are more interested in the 
estimator 9 than 8 p . Nevertheless, we clearly observe in simulation studies a dimension jump for 
some N between Ni and N 2 even if we use the estimators 8 m instead of 9 m p . This suggests that 
the slope heuristic is still valid in the GMRF framework. 

Algorithm 4.1. (Data-driven penalization with slope heuristic). Let M. be a subcollection of M.\. 
1. Compute the selected model fh(N) as a function of N > 



fh{N) G arg min {7^1, P2 ((>m) + N — \ 

m£M y V / TipiP2 J 
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> such that the jump d^(,~ i -\ — eL/r~ -, \ is maximal. 



([Jv mi „]_) a ™(N + ) 



3. Select the model m = fh(2N m - m ) . 



The difference f(x~) — f{x+) measures the discontinuity of a function / at the point x. Step 
2 may need to introduce huge models in the collection M. all the other ones being considered as 
"reasonably small". As the function m(.) is piecewise linear with at most C ard(.M) jumps, so that 
steps 1-2 have a complexity O (Card(Al)) 2 . We refer to App.A.l of AM09l | for more details on the 



computational aspects of steps 1 and 2. Let us ment ion tha t there are other ways of estimating 
N min than choosing the largest jump as described in AM09l | App.A.2. Finally, the methodology 



described in this section straightforwardly extends to the case of isotropic GMRFs estimation by 
replacing fh{N) by m 1S0 (N) and d m by c? 1 ^ . 

In conclusion, the neighborhood selection procedure described in Algorithm 14.11 is completely 
data-driven and does not require any prior knowledge on the matrix £. Moreover, its computational 
burden remains small. We illustrate its efficiency in Section [H 



5 Extension to non-toroidal lattices 

It is often artificial to consider the field X as stationary on a tor us. Ho wever, we needed this 
hypothesis for deriving nonasymptotic properties of the estimator 9 in In many applications, 

it is more realistic to assume that we observe a small window of a Gaussian field defined on the 
plane Z 2 . If we are unable to prove no nasym ptotic risk bounds in this new setting. Nevertheless, 
Lakshman and Derin have shown in that there is no phase transition within the valid 



parameter space for GMRFs defined on the plane Z 2 . Let us briefly explain what this means: 
consider a GMRF defined on a square lattice of size p, but only observed on a square lattice of 
size p' . The absence of phase transition implies the distribution of this field observed on this fixed 
window of size p' does not asymptotically depend on the bound conditions when p goes to infinity. 
Consequently, it is reasonable to think that our estimation procedure still performs well to the price 
of slight modifications. In the sequel, we assume that the field X is defined on Z 2 , but the data X 
still correspond to n independent observations of the field X on the window A of size p\ x p2. The 
conditional distribution of A[o,o] given the remaining covariates now decomposes as 

X[o,o] = 0[i,j]X[i,j] + e[o,o] , (16) 

(i,i)ez=\{(o,o)} 

where 6[.,.] is an "infinite" matrix defined on Z 2 and where e[o,o] is a centered Gaussian variable of 
variance a 2 independent of ( J X"[i,j])(i,_ 7 -)eA\{(o,o)}- The distribution of the field X is uniquely defined 
by the function 9 and positive number a 2 . The set 9 + '°° of v alid pa rameter for 9 is now defined 
using the spectral density function. We refer to Rue and Held [RHOo] Sect. 2. 7 for more details. 

Definition 5.1. A function 9 : Z 2 — > R belongs to the set 0+<°° if it satisfies the three following 
conditions: 

1. 0[o,o] = 0. 

2. For any G Z 2 , 9[i,j] = 6[-i,-j]. 
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3. For any Oi,w 2 ) G [0,2tt) 2 , 1 - cos (iwi + juj 2 ) > 0. 

Similarly, we define the set 0+<°°' lso for the isotropic GMRFs on the lattices. As done in Section 
[2]for toroidal lattices, we now introduce the parametric parameter sets. For any model m G Mi, the 
set 6^'°° refers to the subset of matrices 9 in 0+'°° whose support is included in to. Analogously, 
we define the parameter set Q m ,oc,1S0 corresponding to isotropic GMRFs. 

We cannot directly extend the CLS empirical contrast 7n,pi, P2 (-) defined in ((6j) in this new 
setting because we have to take the edge effect into account. Indeed, if we want to compute 
the conditional regression of Xjji ,j 2 ], we have to observe all its neighbors with respect to to, i.e. 
{Xf[ji+Ji,j2+J2]) (h,h) G m}. In this regard, we define the sublattice A m for any model m G Mi. 

A m := {(ii,i 2 ) G A, (m + (11,12)) C A} , 

where (m + denotes the set to of nodes translated by (i,j). For instance, if we consider the 

model n%i with four nearest neighbors, the edge effect size is one and A m contains all the nodes 
that do not lie on the border. The model TO3 with 12 nearest neighbors yields an edge effect of 
size 2 and A TO contains all the nodes in A, except those which are at a (euclidean) distance strictly 
smaller than 2 from the border. 

For any model m G Mi, any 8' G 8+ ,oc , and any sublattice A' C A m , we define 7 A . piiP2 (-) as an 
analogous of 7n,pi,p 2 (-) except that it only relies on the conditional regression of the nodes in A'. 



7 



.A' 

n,vuP^ ) ■ n Card(A') 

A',iso 



Then, the CLS estimators 8^ and 8^ ' iso are defined by 

9% G arg min ^ (^) and Q^ is ° G arg min . «fi (8') . 

Contrary to m , the estimator 8^ is not necessarily unique especially if the size of A m is smaller 
than d m - Let us mention that it is quite classic al in th e literature to remove nodes to take edge 
effects or missing data into account (see e.g. |Guy95l | Sect. 4. 3). We cannot use anymore fast 
Fourier transform for computing the parametric estimator. Nevertheless, the estimators 8 m are 
still computationally amenable, since they minimizes a quadratic function on the closed convex set 

0+,oo 

Suppose we are given a subcollection M of Mi. We note Km the smallest sublattice among 
the collection of lattices A TO with m € M. In order to select the neighborhood to, we compute 
the estimators 8 m M and minimize the criteria TjJ^ P2 (^m M ) penalized by a quantity of the order 

d TO /(nCard(Ajn)). We compute the quantities 7n^ ltP2 (S m M ) instead of 7^p 1)P2 (^m m ) since we want 
to compare the adequation of the models using the same data set. 

We now describe a data-driven model selection procedure for choosing the neighborhood. It is 
based on the slope heuristic developed in the previous section. 

Algorithm 5.1. (Data-driven penalization for non-toroidal lattice). 

1. Compute the selected model m(N) as a function of N > 

m(N) G arg min \l^ UP Mn M ) + N 



meM { '«.pi,P2v m / n Card{A M ) 
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2. Find N m i n > such that the jump d^/^~ j ^ — ^m(Jiv ] ) ?s max ^ ma ^- 

3. Select the model fh = m(2N min ). 
4- Compute the estimator #,~ m . 

This procedure straightforwardly extends to the casejrf isotropic GMRFs estimation by replacing 
m(N) by m iso (N) and d m by d™. For short, we write 9 (resp. 9 iso ) for 6^ (resp. 0~™' iso ). As for 
Algorithm l4.lt it ls advised to introduce huge models in the collection M. in order to better detect 
the dimension jump. However, when the dimension of the models increases the size of A m decreases 
and the estimator 9^ may become unreliable. The method therefore requires a reasonable number 
of data. In practice, A should not contain less than 100 nodes. 



6 Simulation study 

In the first simulation experiment, we compare the efficiency of our procedure with penalized maxi- 
mum likelihood methods when the field is a torus. In the second and third studies, we con sider th e 
estimation of a Gaussian field observed on a rectangle. The calculations are made with R R^D08]. 
Throughout these simulations, we only consider isotropic estimators. 



6.1 Isotropic GMRF on a torus 

First, we consider X an isotropic GMRF on the torus A of size p = pi = p2 = 20. There are 
therefore 400 points in the lattice. The number of observations n equals one and the conditional 
variance a 2 is one. We introduce a radius r := \/T7 '. Then, for any number 4> > 0, we define the 
p x p matrix 9^ as: 

6>>,o] := , 

0+[ij\ := <j> if \{i,j)\ t < r and ± (0,0) , 
9\i,i\ := if |(i,j)| t >r . 

In practice, we set <f> to 0, 0.0125, 0.015, and 0.0175. Observe that these choices constrain ||i < 1. 
The matrix 9^ therefore belongs to the set 0^™ of dimension 10 introduced in Definition 12.11 

First simulation experiment. In Section [31 we have advocated the use of the estimator 9 
instead of 9 p , although theoretical results are only available for 9 p with p < oo. We recall that 
9 = 9 p with p = oo. We check in this simulation study that the performances of 9 and 9 P with 
different values of p are similar. 

We consider the collection of neighborhoods M := {mo, mi, . . . , m,2o} whose maximal dimension 
df r ° is 21. The estimator 9 lso is built using the CLS model selection procedure introduced in 
Algorithm 14.11 The estimators 9 p so are computed similarly, except that they are based on the 

parametric estimators 6*^° p (Sect. [3]) instead of 9™° . 

The Gaussian field X with (j) = 0.015 is simulated by using the fast Fourier transform. The 
quality of the estimations is assessed by the prediction loss function l(., .) defined in (fTTj) . The 
experiments are repeated 1000 times. For p = 2, 4, 8, we evaluate the risks E e « [l(9' lso , 9^)] and 
Eg* [l(9 p so , 9^)] as well as the corresponding empirical 95% confidence intervals by a Monte-Carlo 
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uate the oracle risks E e *[7(0J! 



method. We also estimate the risks of 0J^ and 9^° p for each model m S M.. It then allows to eval- 

9^,6^)] and the risk ratios E e 4l(9^°, 0^/E^ [Z(0£° pl 6*)]. The risk 
ratio measures how well the selected model m lso performs in comparison to the "best" model m*. 
Moreover, the risk ratio roughly illustrates the oracle type inequality presented in Theorem l3.ll In- 
deed, the infimumjnfmgTvi [l(9 m ,p, 0) +pen(m)] in (14) is a good measure of the risk Eg* [Z(0^,° p , 9^)] 
as explained in Ver09l | Sect. 4. The results are given in Table [TJ They corroborate that the esti- 
mators iso and 0jf perform similarly. Moreover, the risk ratios Eg* [0*°, )]/E e « [Z(0£° % 
correspond to the ratios 



p 


2 


4 


8 


oo 


E e 4l(9 p s °,9*)]/E e4 


)] x 10 2 


4.1 ±0.1 
1.3 ±0.1 


4.2 ±0.2 

1.3 ±0.1 


4.2 ±0.1 

1.3 ±0.1 


4.2 ±0.3 

1.3 ±0.2 



Table 1: First simulation study. Estimates and 95% confidence intervals of the risks Eg* [Z(0 1SO , 
E e , [1(9™, 0*)], and of the ratios E « [Z(0 iso , 0*)]/E e * [1(9™, , 9*)] and E e4 , [1(9™, 9*)}/E e * [1(9™, >p 
with 4> = 0.015 and p = 2, 4, 8. 



Second simulation experiment. We compare the efficiency of the method with two alter- 
native model selection procedures. For each of them, we use the collection M as in the previous 
experiment. The two alternative procedures are based on likelihood maximization. In this regard, 
we first define the parametric maximum likelihood estimator 0™ le for any model me M, 



imle grille 



aT :=arg min -£ p (0', </, X) 

0'ee+' ,s< V' 



where C p (9' , X) stands for the log-lik elihood at the parameter 9'. We then select a model m applying 
either an AlC-type criterion Aka73| or a BIC-type criterion [Sch78| : 



m AlG 



m BIC 



arg min {-2C P (9^, a™ le , X) + 2<c} , 
arg min {-2C p (6^,a^ e , X) + log(/)C } 



For short, we write AIC and 610 for the two obtained estimators AIC 911(1 ^BIC - 

Although AIC 

and BIC procedures are not justified in this setting, we still apply them as they are widely used 
in many frameworks. Their computation is performed efficiently using the fast Fourier transform 
described in Section [231 



The experiments are repeated 1000 times. The Gaussian field is simulated using the fast Fourier 
transform. The quality of the estimations is assessed by the prediction loss function £(., .). For any 
(f> and any of these three estimators, we evaluate the risks Eg^,[l(9 AIC ,9^)}, Eg^f^? 810 , 0^)], and 
E e <f,[l(9™ ,9^)] as well as the corresponding empirical 95% confidence intervals by a Monte-Carlo 
method. We also estimate the risk ratios E e4 , [1(9™, )]/E e « [1(6™* , 6*)] The results are given in 
Tabled 
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(j) x 10 2 





1.25 


1.5 


1.75 


E^(6> AIC ,6^)] x 10 2 


1.2 ±0.2 


3.1 ±0.2 


4.3 ±0.2 


6.4 ±0.2 


E^[Z(^ IC ,^)] x 10 2 


0.01 ±0.01 


1.9 ±0.1 


3.7 ±0.1 


9.7 ±0.3 


[Z(6» iso , 6>^)] x 10 2 


1.6 ±0.2 


3.2 ±0.2 


4.2 ±0.1 


7.2 ±0.3 


E fl 4Z(0*°,0*)]/E fl *[i$£,0*)] 


+oo 


1.9 ±0.7 


1.3 ±0.2 


1.5 ±0.3 



Table 2: Second simulation study. Estimates and 95% confidence^ intervals of the risks 
E„* [Z(^ AIC ,6»^)], E e „ [lie 310 , B*)\, and E e , [l(9 iso , 0*)] and of the ratio E e0 [Z((9 iso , 0+)]/E e + [l@%>. , 0*)]. 



The BIC criterion outperforms the other procedures when = 0, 0.0125, or 0.015 but behaves 
bad for a large (j). Indeed, the BIC criterion has a tendency to overpenalize the models. For the 
two first values of <f) the oracle model in M is too. Hence, overpenalizing increases the performance 
of estimation in this case. However, when <f> increases, the dimension of the oracle model is larger 
and BIC therefore selects too small models. 

In contrast, AIC and the CLS estimator exhibit similar behaviors. If we forget the case = 
for which the oracle risk is 0, the risk of 9 lso is close to the risk of the oracle model (the ratio is 
close to one). Hence, the neighborhood choice for 6 1S0 is almost optimal. 

In conclusion, 6> lso or 8 Alc both exhibit good performances for estimating the distribution of a 
regular Gaussian field on a torus. The strength of our neighborhood selection procedure lies in the 
fact it easily generalizes to non-toroidal lattices as illustrated in the next section. 

6.2 Isotropic Gaussian fields on 1? 

First simulation experiment. We now consider X an isotropic Gaussian field defined on Z 2 but 
only observed on a square A of sizes p = p\ = pi = 20 or p = p\ = p-2 = 100. This corresponds to 
the setting described in Section [H The variance of X[o,o] is set to one and the distribution of the 
field is therefore uniquely defined by its correlation function p(k, I) := corr(X[fc,z], X[o,o]). Again, the 
number of replications n is chosen to be one. In the first exper iment, we use four clas sical cor relation 
functions: exponential, spherical, circular, and Matern (e.g. |Cre93| Sect. 2. 3.1 and [Mat86l p. 



Exponential: p(k, I) = 

Circular: p(k, I) = 

Spherical: p(k, I) — 

Matern: p(k, I) = 




if d(k,l)<r 
else 



where d(k,l) denotes the euclidean distance from (k,l) to (0,0) and JC K {.) is the modified Bessel 
function of order k. In a nutshell, the parameter r represents the range of correlation, whereas k 
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may be regarded as a smoothness parameter for the Matern function. In this simulation experiment, 
we set r to 3. When considering the Matern model, we take K equal to 0.05, 0.25, 0.5, 1, 2, a nd 4. 

The Gaussian fields are simulated using the function GaussRF in the library RandomFields 
For each of experiments, we compute the estimator 1SO based on Algorithm 15.11 with the collection 
M. := {m G M.\ , df° < 18}. Since the lattice A is not a torus, methods based on likelihood 
maximization exhibit a prohibitive computational burden. Consequently, we do not use MLE in 
this experiment. We shall compare the efficiency of 1SO with a variogram-based estimation method. 

We recall that the linear combination j)eA\{(o o)} ^['•jl^'Ki] is the kriging predictor of X[o,o] 
given the remaining variables (Equation (J)). A natural method to estimate in this spatial setting 
amounts to estimating the variogram of the observed Gaussian field and then performing ordinary 
kriging at the node (0,0). More precisely, we first estima te the empirical variogram by applying 
the modulus estimator of Hawkes and Cressie (e.g. [Cre93| Eq. (2.2.8)) to the observed field of 400 
points. Afterwards, we fit this empiri cal variogram to a variogram model using the reweighted least- 
squares suggested by Cressie |Cre85| . This procedure therefore requires the choice of a particular 
variogram model. In the first simulation study, we choose the model that has generated the data. 
Observe that this method is not a daptive since it requires the knowle dge of the variogram model. 
In practice, we use Library geoR RJD01 | implemented in R R D08| to estimate the parameters 



r, var(X[o,o]) and eventually K of the variogram model. Then, we compute the estimator V by 
performing ordinary kriging at the center node of A. For each of these estimations, we assume 
that the variogram model is known. For computational reasons, we use a kriging neighborhood of 
size 11 x 11 that contains 120 points. Previous simulations have indicated that this neighborhood 
choice does not decrease the precision of the estimation. For the Matern model with k — 2 and 4, 
the covariance is almost singular. There are sometimes inversion difficulties and we therefore use 
kriging neighborhood of respective size 7x7 and 3x3. 

We again assess the performances of the procedures using the loss l(., .). Even if this loss is de- 
fined in fTTj) for a torus, the alternative definition lfT2|) clearly extends to this non-toroidal setting. 
Consequently, the loss 1(0,0) measures the difference between the prediction error of X[o,o] when 
using S(i 3 -) e A\{(o,o)} ^Wl-^W] an d the prediction error of X[o,o] when using the best predictor 
E[X[o,o]|(X[ij])(,.j) e A\{(o,o)}]- In other words, 1(0,0) is the difference of the kriging error made with 
the estimated parameters and the kriging error made with the true parameter 0. 

The experiments are repeated 1000^ times. For any of the four correlation models previously 
mentioned, we evaluate the risks Eg[l(0 mo , 0)\ and Eg[Z((r ,#)] by Monte-Carlo. In order to assess 
the efficiency of the selection procedure, we also evaluate the risk ratio 

E e [Z(0^' iBO ,0)] 

Risk.ratio 



As in Section l6~!T| the oracle risk K[l(0^fi t ' lso , 0)] is evaluated by taking the minimum of the evalua- 
tions of the risks E[l(0^ M ' iso ,0)} over all models m G A4. Results of the simulation experiment are 
given in Table [3] and HI 

Observe that none of the fields considered in this study are GMRFs. Here, the GMRF models 
should only be viewed as a collection of approximation sets of the true distribution. This simulation 
experiment is in the spirit of Rue and Tjelmeland's study RT02| . However, there are some major 



differences. Contrary to them, we perform estimation and not only approximation. Moreover, our 
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lattice is not a torus. Finally, we use our prediction loss £(., .) to assess the performance, whereas 
they compare the correlation functions. 



Model 


Exponential 


Circular 


Spherical 


E e [l(e y ,9] x 10 2 


0.08 ±0.01 


9.1 ±0.5 


2.9 ±0.1 


E g [l(9 iso ,9)} x 10 2 


1.08 ±0.01 


6.5 ±0.1 


3.4 ±0.1 


Risk.ratio 


3.6 ±0.4 


1.4 ±0.1 


1.6 ±0.1 



Table 3: Estimates and 95% confidence intervals of the risks £^(0^,0)] and E g [l(9 iso ,9)} and of 
Risk.ratio for the exponential, circular and spherical models with p = 20. 



K 


0.05 


0.25 


0.5 


1 


E [l(9 v ,9)} x 10 3 


91.8 ±0.7 


80.0 ±0.2 


18.0 ±0.1 


2.5 ±0.1 


E e [l(9 iso ,9)} x 10 3 


2.24 ±0.01 


0.62 ±0.01 


0.33 ±0.01 


0.08 ±0.01 


Risk.ratio 


1.3 ±0.1 


1.7 ±0.2 




1.5 ±0.2 


1.3 ±0.1 




K 


2 




4 




E e [l(9 v ,9)} x 10 4 
E e [l(9 iso ,9)} x 10 4 
Risk.ratio 


6.3 ± 1.1 
1.9 ±0.1 
2.6 ±0.2 


0.011 ±0.001 
0.17 ±0.01 
1.1 ±0.1 





Table 4: Estimates and 95% confidence intervals of the risks E^fl^fl)] and E g [l(9 iso ,9)} and of 
Risk.ratio for Matern model with p = 100. 

Comments on Tables and gj In both tables, the ratio E e [l(9^ M ' iso ,9)}/E [l(9^' iso ,9)} stays 
close to one. Hence, the model selection is almost optimal from an efficiency point of view. In most 
of the cases, the estimator 9 ,so outperforms the estimator based on geostatistical methods. This 
is particularly striking for the Matern correlation model because in that case the computation of 
W requires the estimation of the additional parameter k. Indeed, let us recall that the exponential 
model and the Matern model with K = 0.5 are equivalent. For K = 0.5, the risk of £r is 100 times 
higher when K has to be estimated than when k is known. 

Second simulation experiment. The kriging estimator 0^ requires the knowledge or the 
choice of a correlation model. In the second simulation experiment, the correlation of X is the 
Matern function with range r = 3 and k = 0.05. The size p of the lattice is chosen to be 100. We 
now estimate 9 using different variograrn models, namely the exponential, the circular, the spherical 
and the Matern model. The estimator 9 1S0 for such a field was already considered in Tabled The 
experiment is repeated 1000 times. 

Comments on Table [H One observes that circular and spherical models yield worse perfor- 
mances than Matern model. In contrast, the exponential model behaves better. The choice of the 
variograrn model therefore seems critical to get good performances. The model selection estimator 
9 1S ° (Table HI exhibits a smaller risk than the exponential model. 
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Model 


Exponential 


Circular 


Spherical 


Matern 


Eg[l(0 V ,e)} X 10 3 


48.3 ±0.4 


461 ± 16 


293 ±7 


91.8 ±0.7 



Table 5: Estimates and 95% confidence intervals of the risks Eg[l(9 , 9)] for Matern model with 
K = 0.05 when using the exponential, circular, spherical, and Matern models with p = 100. 



6.3 Anisotropic Gaussian fields on Z 2 

We still consider X a Gaussian field observed on a square A of size 100 x 100. Contrary to the 
previous study, the field is not assumed to be isotropic. To model the geometric anisotropy, we 
suppose that X is an isotropic field on a deformed lattice A'. The transformation consists in multi- 
plying the original coordinates by a rotation R and a shrinking matrix T. For the sake of simplicity, 
we take the identity for R. The shrinking matrix T is defined by the anisotropy ratio (Ani. ratio). 
It corresponds to the ratio between the directions with smaller and greater continuity in the field 
X , i.e the ratio between maximum and minimum ranges. In this experiment, X follows a Matern 
correlation with rangej = 3, k = 0.05, 0.25, 0.5, 1, 2, and 4 and Ani.ratio=2 or 5. We compute the 
anisotropic estimator 9 based on Algorithm l5.il with the collection M := {m € M.\,d m < 28}. As 
a benchmark, we also compute the variogram-based estimator W based on the Matern model. In 
order to compute W , we assume that we know the anisotropy ratio and the anisotropy directions. 
Observe that the estimator 9 does not require any assumption on the form of anisotropy, while £r 
uses the geometric parameters of the anisotropy. 

The experiments are repeated 1000 times. We evaluate the risks Eg[i(^ ,9)] and Eg[l(9,9)] and 
the risk ratio defined by 

Risk.ratio = ^ ^J? — - — — . 

Eg[l(9^,e)] 



K 


0.05 


0.25 


0.5 


1 


Eg[l(9 v 1 9)} x 10 2 


15.8 ±0.1 


13.9 ±0.1 


3.3 ±0.1 


0.30 ±0.01 


Ee[i(e,e)} x io 2 


0.65 ±0.01 


0.20 ±0.01 


0.089 ±0.001 


0.17 ±0.01 


Risk.ratio 


1.2± 0.1 


1.1 ±0.1 


1.1 ± 0.1 


1.7±0.2 




K 


2 


4 




E e [l(9 v ,9)} x 10 4 


9.8 ±0.1 


0.020 ±0.001 




Eg[l 


9' mo ,9)} x 10 4 


45.0 ±0.1 


4.3 ±0.1 




Risk.ratio 


2.9 ±0.2 


22.3 ± 1.7 





Table 6: Estimates and 95% confidence intervals of the risks Eg^ffi ,9)] and Eg[l(9,9)] and of 
Risk.ratio for Matern model and Ani. ratio— 2. 

Comments on Tables^ and [3 Except for the cases k = 2,4, the estimator 9 performs better 
than the variogram-based estimator V ' J although 9^ uses the true anisotropy parameters. For 
K = 4, the neighborhood selection is no performed efficiently (the risk ratio is large). 
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K 


0.05 


0.25 


0.5 


1 


E g [i(e v ,9)} x io 2 


11.2 ±0.1 


14.9 ±0.1 


3.7 ±0.1 


2.9 ±0.1 


E e [l{9,9)} x IO 2 


0.66 ±0.1 


0.40 ±0.01 


0.081 ±0.001 


0.14±0.01 


Risk.ratio 


1.1 ±0.1 


1.1 ±0.1 


1.2 ±0.1 


3.4 ±0.8 



K 


2 


4 


Eg[l(0 V ,0)] x 10 4 


30.6 ±0.1 


0.22 ±0.01 


Eg[l(9' lso ,9)} X 10 4 


38.0 ±0.1 


39.6 ±0.1 


Risk.ratio 


2.1 ±0.1 


9.0 ± 1.4 



Table 7: Estimates and 95% confidence intervals of the risks E^Z^, 0)] and Eg[l(6,0)] and of 
Risk.ratio for Matern model and Ani.ratio= 5. 



7 Discussion 



In this paper, we have extended a neighborhood selection procedure introduced in [Ver09j . On the 



one hand, an algorithm is provided for tuning the penalty in practice. On the other hand, the new 
method also handles non-toroidal lattices. The computational complexity remains reasonable even 
when the size of the lattice is large. 

In the case of stationary fields on a torus, our neighborhood selection procedure exhibits a 
computational burden and statistical performances analogous to the AIC procedure. Even if AIC 
has not been analyzed from an efficiency point of view, this suggests that AIC may achieve an 
oracle inequality in this setting. Moreover, we have empirically checked that 9 performs almost as 
well as the oracle model since the oracle ratio E[l(6, 6)]/E[l(9 m *, S)\ remains close to one. 

The strength of this neighborhood selection procedure lies in the fact it easily extends to non- 
toroidal lattices. We have illustrated that our method often outperforms variogram-based estimation 
methods in terms of the mean-squared prediction error. Moreover, the procedure behaves almost as 
well as the oracle. In contrast, variogram-based procedures may perform well for some covariances 
structure but also yield terrible results for other covariance structures. These results illustrate the 
adaptivity of the neighborhood selection procedure. 



In many statistical applications, Gau ssian fi elds (or Gaussian Marko v random fields) are not 
directly observed. For instance, Aykroyd Avk98| or Dass and Nair DN03| use compound Gaussian 
Markov random fields to account for non stationarity and steep variations. The wavelet transform 
has emerged as a powerful tool in ima ge analysis. The wa velet coefficients of an image are sometimes 
modeled using hidden Markov models [CNB98I . IPSWSOSl ] . More genera lly, the su ccess of the GMRFs 
is mainly due to the use of hierarchical models involving latent GMRFs The study and the 

implementation of our penalization strategy for selecting the complexity of latent Markov models 
is an interesting direction of research. 



8 Proofs 

Let us introduce some notations that shall be used throughout the proofs. For any 1 < k < n, the 
vector X]! denotes the vectorialized version of the k-th sample of X. Moreover, X v is the matrix 
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of size P1P2 x n of the n realisations of the vector X£. Throughout these proofs, L, L\, L2 denote 
constants that may vary from line to line. The notation L(.) specifies the dependency on some 
quantities. Finally, the 7(.) function stands for an infinite sampled version of the CLS criterion 
7n,pi,p 2 (-): 7(0 :=E[7n, Pl , P2 (-)]- 



8.1 Proof of Lemma l2Jl 

Let us provide an alternative expression of 7„ iPl ,p 2 (6*') in term of the factor C{9') and the empirical 
covariance matrix X V X V * . 

ln, PuP2 {6') = -^—tr[(I PlP2 -C(9'))X^^(I PlP2 -C(9'))] . (17) 
This is justified in [Ver09| Sect. 2.2. 

Lemma 8.1. There exists an orthogonal matrix P which simultaneously diagonalizes every P1P2 x 
P1P2 symmetric block circulant matrices with P2 x P2 blocks. Let 9 be a matrix of size p\ x p 2 such 
that C(9) is symmetric. The matrix D{9) = P*C(9)P is diagonal and satisfies 

P l P2 

D(9)[(i-l)p 2 +j,(i-l)p2+j] 

k=l 1=1 

for any 1 < i < p\ and 1 < j < pi . 

This lemma is proved as in [RH05] Sect. 2. 6. 2 to the price of a slight modification that takes 
into account the fact that P is orthogonal and not unitary. The difference comes from the fact 
that contrary to Rue and Held we also assume that C{9) is symmetric. Lemma [8.11 states that 
all symmetric block circulant matrices are simultaneously diagonalizable. Observe that for any 
1 < i < Pi and 1 < j < p 2 , it holds that D(9)[(i-i) P2 +j,(i-^)P2+j] = \[i,j](9) since 9[k,i] = 9[ Pl -k, P2 ~i]. 
Hence, Expression fl7|l becomes 

-, r Pl P2 r n 

7™ 2 (0') = OZ^[1-AM(0)] 2 Y, [ P ^t(^lT P ] K*-i)p a +i,(i-i) Pa +i] 

TIP1P2 I11 > 1 

v t=l j = l u fe=l 

where X£! is the vectorialized version of the fc-th observation of the field X. Straightforward 
computations allow us to prove that the quantities 

(P*X£(X£)*P) i(i-i) P2+j ,(i-i) P2+ j] + (P*X£(X£)* P) {[ p ^-i-i) P2+P2 -j,{ Pl -i-i) P2+P2 -j] 

and 

1 . T^T 1 



A[«>j](Xfc)A[i,j](X^) H — -^=\[ P i-i, P2 -j](X. k )\\p 1 -i, P2 -j](X k ') 



y/PlP2 

are equal for any 1 < i < p\ and 1 < j < P2- Here, the entries of the matrix A(.) are taken 
modulo pi and P2 and the entries of [P*X^(X^!)*P] are taken modulo pip2- The result of Lemma 
[2J] follows. 



INRIA 



Neighborhood selection 



21 



8.2 Proof of Proposition 14.11 



Proof of Proposition ^ -1\ We only consider the anisotropic case, since the proof for isotropic esti- 
mation is analogous. For any model m e Mi, we define 

A(m,m') := 7„, p ,p (o m ,p) + pen(m) - "f n , P , P \ 9m',p) - pen(m') . 

We aim at showing that with large probability, the quantity A(m, m!) is positive for all small 
dimensional models m. Hence, we would conclude that the dimension of m is large. In this regard, 
we bound the deviations of the differences 



n 7 p,p \ w m,p j 



+ 



(0)- 



n,p,p \vfn,p 



) - 7n,p,p (0) 



Lemma 8.2. Let K 2 be some universal constant that we shall define in the proof. With probability 
larger than 3/4, 

K 2 2 ^ d m V 1 



ln,p,p{9) — ln,p.p{9m,p) < — P fmaxC^)- 

2 np* 



ano 



dn 

np* 



for all models m £ M.\ 



Lemma 8.3. Assume that p is larger than some numerical constant po. With probability larger 
than 3/4, it holds that 

ln,p,p( S ) ~ ln, v ,p{6m- ,p) > K 3 (T 2 {lf min (l p 2 - C{6)) A [p - if max (l p 2 - C(6))] } ^ , 

where K3 is a universal constant defined in the proof. 

Let us take K\ to be exactly K 3 . Gathering the two last lemma with Assumption lfl5|) . there 
exists an event of probability larger than 1/2 such that 



A(m, m) > 



np z 



K lV d m , [ Vmin - C{6)) A [p- ^ raax (7 p2 - C(0))]] - ^2 y J n (7^_c (9)) 



for all models rn G .Mi. Thus, conditionally to ft, A(m,m') is positive for all models m G .Mi that 
satisfy 



d m V 1 



< 



r^min (V - {<^ mi n (ip* - C(0)) A [p - </> max (I p 2 - C(0))] } 



By Lemma 8.7 in [VerD^ . the dimension d m ' is larger than 0.5[\/np 2 A (p 2 — 1)]. We conclude that 



<fo„Vl > 



np 2 A p — 1 



with probability larger than 1/2. 



(Pmin (/p2 - C(0)) {<p miB (I P 2 - C{6)) A [p - <^ max (7 p2 - C{9))] } , 



□ 
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Proof of Lemma \8.2\ In the sequel, 7 n „ „(-) denotes the difference jn tP)P (.) -"((■)■ Given a model 
to, we consider the difference 

7n,p,p (0) ~ ln,p,p (<W) = 7„,p,p (°) - ln,p,p (<W) - l(0 m , P , 0) . 

Upper bounding the difference of J n , p , p therefore amounts to bounding the difference of J„ tPp . By 
definition of 7n, p , P and 7, it expresses as 

in,p,p {0) - ln,p, P (9m, P ) = ±tr { [(V - c{9)f - (V - c(9 m , p )) 2 } (X^X^ - E) } . 

The matrices E, (J p 2 — C(6)), and (J p 2 — C(9 m , P )) are symmetric block circulant. By Lemma l8TT| 
they are jointly diagonalizable in the same orthogonal basis. If we note P an orthogonal matrix 
associated to this basis, then C{6 m ,p), C(9), and E respectively decompose in 

C{9 m<p ) = P*D{6 miP )P , C{6) = P*D(9)P and E = P*D{H)P , 

where the matrices D(9, n , p ), D(9), and -D(E) are diagonal. 

7n,p,p (0) — (® m -.p) = 

^tr {(D(9 m , p ) - D(9)) [2I p 2 - D(9) - D(9„ hP )] (YW - I p 2)} , (19) 



where the matrix Y is defined as Pv / E _1 X' u P*. Its components follow independent standard 
Gaussian distributions. Since the matrices involved in (fl9|) are diagonal, Expression (fl9|) is a 
linear combination of centered \ 2 random variables. We apply the following lemma to bound its 
deviations. 

Lemma 8.4. Let (Y±, . . . , Yd) be i.i.d. standard Gaussian variables. Let ax,...,an be fixed num- 
bers. We set 

D 

\\a\\oo ■= sup |oi|, \\a\\l -.= ^2^ 
i=i,..,D l=1 

Let T be the random variable defined by 

T:=f>(l?-l) . 

i=l 

Then, the following deviation inequality holds for any positive x 

P [T > 2||o|| 2 v^ + 2||a|Ua;] < e~ x . 

This result is very close to Lemma 1 of Laurent and Massart in |LM00| . The only difference 
lies in the fact that they constrain the coefficients a, to be non-negative. Nevertheless, their proof 
easily extends to our situation. Let us define the matrix a of size n x p 2 as 



a 



D E[i ,i] (D(0 m , p )[i,n - D(6)[i,i]) (2-£>(0[i,i] - D(0 m>p )[i,i}) 



np 2 
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for any 1 < i < n and any 1 < j < p 2 . Since the matrices I — C(8) and I — C(6 m , p ) belong to 
the set 8+ , their largest eigenvalue is smaller than p. By Definition fTTj) of the loss function .), 
\\ah ^ 2/9- v /^ max (i;)/(0 m ,p, 0)/ (np 2 ) and Halloo < 4 / 9 : V max (£)/(np 2 ). By Applying Lemma[83]to 
Expression (fT9| . we conclude that 



7„, P , P W " 7„,p,p (0m,„) > 6) + 12p 2 ^4^: 



for any x > 0. Consequently, for any K > 0, the difference of 7 n ,p,p(-) satisfies 

{ o \ fn \ ^ ^ 2 /p\ dm V 1 

ln,p,p{V) — ln,p,p(t>m,p) S — P ImaxOJ 2 , 

simultaneously for all models m e .Mi with probability larger than 1 — YlmeMA® e _K ^ dmV1 )/ 24 . If 
A' is chosen large enough, the previous upper bound holds on an event of probability larger than 
7/8. Let us call K 2 such a value. 

Let us now turn to the second part of the result. As previously, we decompose the difference of 
empirical contrasts 

7«,P,P i@m,p) ~ In.p.p (j?m,p^ = 7n,p,p (fim,p) ~~ 7n,p,p (®™,i> \ ~ ^ {@m,pi @m,p^ 

Ar guing a s in the proof of Theorem 3.1 in Ver09j . we obtain an upper bound analogous to Eq.(49) 
in [VerOfll l 



7„,p,p {0m,p) ~ 7„,p,p (0 m ,p) < l@m,p, m ,p) + p 2 ( Sup \tr [RD^ (YY* - I p2 )] } . 

L ReB S m 2 P > 

The set B^ 2 m i is defined in the proof of Lemma 8.2 in Its precise definition is not really 

of interest in this proof. Coming back to the difference of 7 n , p , p (.), we get 

7n,p,p {dm.p) ~ ln,p,p (#m.p) < p 2 \ SUp -Jx [i?-D E (YY* - I p2 )] 
V ' Rem'. - P 



We consecutively apply Lemma 8.3 and 8.4 in to bound the deviation of this supremum. 

Hence, for any positive number a, 



7«,p,p (#m,p) - 7n,p,p (#m,p) < + Ct/2)p 2 ip max (£) ^ . (20) 



with probability larger than 1 — exp[— L 2 v /a m("y==yj A 1^75 )]• Thus, there exists some numerical 

constant ctQ such that the upper bound (|20| with a = a® holds simultaneously for all models 
m G .Mi \ with probability larger than 7/8. Choosing A' 2 to be the supremum of K' 2 and 
2Li(l + ao/2) allows to conclude. 

□ 
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Proof of Lemma\8jJ[ Thanks to the definition ifTTj) of J n ,p,p(-) we obtain 

ln, P A e )-"(n,p, v @™',p) = \ SU P & [{C(6') - C(0)) (2I P 2 - C{6) - C{6')) TTZF\ , 

ret' ,p 

where the p 2 x n matrix Z is defined by Z := VT, X v . We recall that the matrices S, C{9) and 
C{6') commute since they are jointly diagonalizable by Lemma [8.11 Let (O^, . — 9) be the set 
9+, p translated by 9. Since C{9) + C(8') = C{9 + 0'), we lower bound the difference of ^f n ,p,p(-) as 
follows 

7«, P , P (0) - 7n,p,p0m', P ) = \ sup 2a 2 tr [C{9')TZF\ - tr [C{9') 2 YWZF\ 

2 

> °- sup {2tr [C(9')Z2F\ - ip~l [7 p3 - C{6)] tr [C(9') 2 Z7F] } 
P 

Let us consider . . . , "J^ j d ; a basis of the space m < defined in Eq.(14) of [VerOd |. Let a 

be a positive number that we shall define later. We then introduce 9' as 

d m , 

p fc=l 

Since is assumed to belong to 6+, , the parameter 9' belongs to — 9) if 

^max [C(0')] < <Pmi„ (V " C '( )) and <Pm\n[C{&)] > -P + (V-C((9)) . 

. The largest eigenvalue of C(0') is smaller than ||0'||i whereas its smallest eigenvalue is larger than 
— ||0'||i. Let us upper bound the l\ norm of & : 

d m , 

\\9'\\x = 2 ( p min [l p2 -C(e)]^J^\tr[C^ iktjk )Z^\\ 

P k=l 



< 2j-<p min [I p2 - C(9)] d m dr [C{9')TZF\ . (21) 



Hence, & belongs to (9+, p - 9) if 

ll^'lll < ^min (V ~ C ( 9 )) A [P - ¥W (V ~ C ( 6 ))] ■ ( 22 ) 

Thus, we get the lower bound 

2 

Jn,pA0) - ln,p,p0m', P ) > ^ {2tr [C(9')Z7F] - <p^ n [I P 2 - C{9)] tr [C(9') 2 Z7F] } , (23) 



as soon as Condition l|22|) is satisfied. 

Let us now bound the deviations of the two random variables involved in (j2Tj) and (j23j) by 
applying Markov's and Tchebychev's inequality. For the sake of simplicity, we assume that d m ' is 
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smaller than (p 2 — 2p)/2. In such a case, all the nodes in m! are different from their symmetric in 
A. We omit the proof for d m > larger than (p 2 — 2p)/2 because the approach is analogous but the 
computations are slightly more involved. Straightforwardly, we get 

E [tr (C(9')ZW)] = 4c^ min [l p , - C{9)] ^f - , 

since the neighborhood m' only contains points whose symmetric is different. A 

cumbersome but pedestrian computation leads to the upper bound 

var [tr (C(6')Z^)] < L ia 2 ^ 2 min [l p , - C{9)] ^ , 

where L\ is a numerical constant. Similarly, we upper bound the expectation of tr [C(9') 2 ZZ*\ 

E [tr (C(9') 2 ZW)] < L 2 a^ min [l p > C(6)] *f . 

Let us respectively apply Tchebychev's inequality and Markov's inequality to the variables tr [C(9')ZZ* 
and tr [C(9') 2 ZZ*\ . Hence, there exists an event ft of probability larger than 3/4 such that 

2tr [C(e')ZF] - Vm \ a [V - C(fl)] tr [CtffZF] > 



,„,4V-^)]^{8a(l-j£-)-a^} 



and 



tr [C(9')ZZF\ < 4«^ min [I P 2 - C(9)] 



d m ' ( ^ ^ / L[ 

V dm' 



In the sequel, we assume that p is larger than some universal constant po, which ensures the 
dimension d m i to be larger than kL\. Gathering lj2lj) with the upper bound on tr [C(#')ZZ*] yields 

||0'||i < 2V2a^ min [I P 2 - C{9)] ^p= < 2V2a^ min [I p , - C{9)] , 

V np 

since d m i < p^/n. If 2\[2a is smaller than 1 A { [p — </3 max {l p 2 — C(0))} [l p 2 — C(fi)Y\ , then 
Condition 11221) is fulfilled on the event f2 and it follows from 11231) that 



{ln, P , P (8) - ln,p,p@m', P ) > [V - C(6)] ^ [a - a 2 L' 2 /A\ } > | 

-ymax(/ p 2-C(e)) 

4 Vmi „(/ p 2-c(e)) 



Choosing a = % A £ A ^Z^^C^ , we get 



Jn, P ,p{°) ~ ln,p, P {9 m ' , p ) > K 3 a 2 {(p min [I P 2 - C{9)] A[p- ip max (l p 2 - C{9))] } ^ } ^ 

np J 



dm' 1^3 
4 " 



where K3 is an universal constant. □ 
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